SWE-bench: add Slurm tests & other improvements by ludwig-n · Pull Request #920 · NVIDIA-NeMo/Skills

ludwig-n · 2025-10-09T14:41:28Z

This PR adds Slurm tests for SWE-bench. Specifically, the Qwen3-Coder 30B model is evaluated with OpenHands and SWE-agent.

It also adds the following features & fixes:

When retrying an Apptainer command, there is now an interval of 1-3 minutes between retries. This is because it's common for instances to fail due to network issues (e.g. failing to download Miniforge), and when the retry happens immediately, it usually fails with the same problem. When the job waits before retrying, it is much more likely to succeed.
- The retry interval is selected randomly between the new options min_retry_interval and max_retry_interval. The randomness is an attempt to spread out the retries to ease the load on the network. You can also configure the number of retries.
SWE-bench evaluation now uses the uploaded dataset instead of redownloading it from HF. This is more efficient and also prevents HF rate limiting issues.
You can now configure the repo URL and commit for the eval harness the same way you can already do it for OpenHands/SWE-agent. The default behavior does not change.
Output search paths no longer use the globstar (**). When running on large datasets, it can be very slow, because the output folders have a lot of files and the glob has to traverse every subfolder recursively.

Summary by CodeRabbit

New Features
- Configurable retry behavior with min/max intervals and updated defaults.
- Ability to specify evaluation harness repository and commit.
- More resilient evaluation runs with randomized backoff and clearer logs.
Documentation
- Refined evaluation parameter descriptions and checkout semantics.
- Added guidance for SWE-bench tests, image mounts, and container options.
- Updated sample result metrics.
Tests
- New Slurm workflow for Qwen3-Coder 30B with automated result validation.
- Test runner extended to include the new workflow.

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

coderabbitai · 2025-10-09T16:06:34Z

Walkthrough

Adds configurable retry logic and harness repo/commit options to SWE-bench evaluation code, updates documentation accordingly, and introduces Slurm test orchestration for Qwen3-Coder 30B with metric checks. Also updates Slurm README and run_all.sh to include the new test.

Changes

Cohort / File(s)	Summary
SWE-bench evaluation core `nemo_skills/inference/eval/swebench.py`	Adds eval_harness_repo/commit and retry settings to config; refactors command execution to use cfg-driven retries with randomized backoff; updates clone/checkout commands and path patterns; adjusts method signature to remove per-call max_retries.
Documentation updates `docs/evaluation/code.md`	Documents new/changed parameters (agent framework commit semantics, eval_harness_repo/commit, retry settings); updates sample metric values.
Slurm tests: Qwen3-Coder 30B suite `tests/slurm-tests/qwen3coder_30b_swebench/check_results.py`, `tests/slurm-tests/qwen3coder_30b_swebench/run_test.py`	Adds metric range checker and Slurm orchestration script running SWE-bench for Qwen3-Coder 30B across OpenHands and SWE-agent, with dependent result checks.
Slurm tests: infra/docs `tests/slurm-tests/README.md`	Notes default SWE-bench images path, mount requirements, and container formatter guidance.
Slurm tests: orchestrator `tests/slurm-tests/run_all.sh`	Invokes the new qwen3coder_30b_swebench test, adding a delay before launch and integrating into the existing run sequence.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Task as SweBenchGenerationTask
  participant C as Container
  participant Git as Git Repos

  rect rgba(230,245,255,0.5)
    note over Task: Setup repos per cfg
    Task->>Git: clone SWE-bench / agent / eval harness (cfg repos)
    Task->>Git: checkout cfg commits/tags/branches
  end

  loop For each command
    rect rgba(240,255,240,0.6)
      note over Task: Retry with randomized backoff
      Task->>C: Execute command (timeout=T)
      alt Success
        C-->>Task: Expected file(s) match pattern
      else Failure or Missing output
        Task->>Task: Sleep U[min_retry_interval, max_retry_interval]
        note over Task: Retry until cfg.max_retries exhausted
      end
    end
  end

  note over Task: Proceed to evaluation using cfg harness and paths

sequenceDiagram
  autonumber
  participant CLI as run_test.py
  participant Prep as prepare_data
  participant Slurm as Slurm Scheduler
  participant Eval as eval (per framework)
  participant Check as check_results.py

  CLI->>Prep: prepare_data (container_formatter)
  loop For agent_framework in [openhands, swe_agent]
    CLI->>Slurm: submit Eval job (expname_prefix, model/server args)
    Slurm-->>Eval: run
    CLI->>Slurm: submit Check job (after: Eval)
    Slurm-->>Check: run (loads metrics.json, assert ranges)
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

A hop, a skip, retries in tow,
I nibble logs where failures grow.
New harness paths, a gentle nudge,
Slurm queues hum—results won’t budge.
Qwen’s code sings, benchmarks gleam,
I thump: “All green!”—a rabbit’s dream. 🐇✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly highlights the addition of Slurm tests for SWE-bench and acknowledges miscellaneous improvements without unnecessary detail, clearly reflecting the PR’s primary purpose and scope.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ludwig-n/swe-bench-slurm-tests

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4310d7 and b7fb5a2.

📒 Files selected for processing (1)

tests/slurm-tests/run_all.sh (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: unit-tests

🔇 Additional comments (1)

tests/slurm-tests/run_all.sh (1)

15-15: Container placeholder preserved

Quoting --container_formatter keeps {instance_id} intact, so each SWE-bench job resolves the proper image path. Nice catch.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (2)
36-36: Clarify the comment about dependent jobs.

The comment states "automatically rerun 2 times," but dependent_jobs=2 schedules 2 dependent jobs, resulting in 3 total runs (1 original + 2 dependents). Consider revising the comment to: "automatically schedule 2 dependent reruns (3 total runs)."

37-37: Update comment for clarity.

The comment mentions "second run (swe-agent)" but this function is invoked for both openhands and swe_agent frameworks. The described issue with reading config files from absolute cluster paths applies to both frameworks, not specifically the second framework.

Apply this diff:
-        reuse_code=False,  # otherwise the second run (swe-agent) tries to read the config file from the absolute cluster path and fails
+        reuse_code=False,  # otherwise subsequent runs try to read the config file from the absolute cluster path and fail

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e5877c and d4310d7.

📒 Files selected for processing (6)

docs/evaluation/code.md (3 hunks)
nemo_skills/inference/eval/swebench.py (10 hunks)
tests/slurm-tests/README.md (1 hunks)
tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (1 hunks)
tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1 hunks)
tests/slurm-tests/run_all.sh (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (2)

tests/slurm-tests/utils.py (4)

assert_all (46-58)

get_nested_value (24-30)

load_json (18-21)

soft_assert (36-43)

tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)

main (45-86)

nemo_skills/inference/eval/swebench.py (1)

nemo_skills/inference/chat_interface/core.py (1)

cfg (181-182)

tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (3)

nemo_skills/pipeline/run_cmd.py (1)

run_cmd (40-211)

nemo_skills/pipeline/cli.py (1)

wrap_arguments (45-54)

tests/slurm-tests/qwen3coder_30b_swebench/check_results.py (1)

main (44-52)

🪛 Ruff (0.13.3)

tests/slurm-tests/qwen3coder_30b_swebench/check_results.py

22-22: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

nemo_skills/inference/eval/swebench.py

253-253: Standard pseudo-random generators are not suitable for cryptographic purposes

(S311)

264-266: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🪛 Shellcheck (0.11.0)

tests/slurm-tests/run_all.sh

[warning] 15-15: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 15-15: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

🔇 Additional comments (1)

tests/slurm-tests/qwen3coder_30b_swebench/run_test.py (1)
85-85: To inspect how add_task handles run_after, run:
#!/bin/bash
rg -nP "def add_task" -C3 nemo_skills/pipeline

coderabbitai · 2025-10-09T16:10:59Z

tests/slurm-tests/qwen3coder_30b_swebench/check_results.py

+    for category_tuple, expected_range in METRIC_RANGES[agent_framework].items():
+        val = float(get_nested_value(data, category_tuple))
+        lo, hi = expected_range
+        soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")


⚠️ Potential issue | 🟠 Major

Handle missing or non-numeric metrics without crashing

get_nested_value returns None when the metric is absent (e.g., evaluation failed), so float(None) raises and the script aborts before calling assert_all(). We should soft-assert in that case to report a clear failure instead of throwing a traceback.

- val = float(get_nested_value(data, category_tuple)) - lo, hi = expected_range - soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]") + raw_val = get_nested_value(data, category_tuple) + if raw_val is None: + soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple} missing") + continue + try: + val = float(raw_val) + except (TypeError, ValueError): + soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple}={raw_val} not numeric") + continue + lo, hi = expected_range + soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for category_tuple, expected_range in METRIC_RANGES[agent_framework].items():

val = float(get_nested_value(data, category_tuple))

lo, hi = expected_range

soft_assert(lo <= val <= hi, f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")

for category_tuple, expected_range in METRIC_RANGES[agent_framework].items():

raw_val = get_nested_value(data, category_tuple)

if raw_val is None:

soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple} missing")

continue

try:

val = float(raw_val)

except (TypeError, ValueError):

soft_assert(False, f"swe-bench ({agent_framework}) {category_tuple}={raw_val} not numeric")

continue

lo, hi = expected_range

soft_assert(lo <= val <= hi,

f"swe-bench ({agent_framework}) {category_tuple}={val} out of range [{lo},{hi}]")

doesn't really matter here, we're only checking one metric value, so it can only fail once regardless.

also we don't bother with this in the other check_results.py scripts...

@ludwig-n, that's fair. Given there's only one metric being checked and the existing pattern in other check_results.py scripts, the additional error handling isn't necessary. I appreciate the clarification!

tests/slurm-tests/run_all.sh

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Kipok · 2025-10-09T21:42:59Z

tests/slurm-tests/qwen3coder_30b_swebench/run_test.py

+        benchmarks="swe-bench",
+        num_chunks=8,
+        dependent_jobs=2,  # automatically rerun 2 times because it's common for some instances to fail
+        reuse_code=False,  # otherwise the second run (swe-agent) tries to read the config file from the absolute cluster path and fails


could you share an error please? This shouldn't be happening

Merging for now, but we should address this in another pr

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

ludwig-n added 13 commits October 3, 2025 17:20

Add options to customize eval harness repo & commit

33c43c5

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Use uploaded dataset for eval

21478af

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add configurable interval between Apptainer retries

a75396b

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Don't use ** in search paths for efficiency

7a10ba0

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Fix comment

e1982f3

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add Slurm tests for SWE-bench

3059601

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add docs for new params

77d9b68

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add test to run_all.sh

a1c36f9

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Rename test

8c1238c

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Adjust metric ranges for new envs

6bfb04c

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Update expected score for sample run in docs

1b19ab2

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

Add agent framework name to expname

1664717

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

reuse_code=False in test

89b471f

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

ludwig-n mentioned this pull request Oct 9, 2025

Add slurm tests for swe-bench #879

Closed

Merge branch 'main' into ludwig-n/swe-bench-slurm-tests

d4310d7

coderabbitai bot reviewed Oct 9, 2025

View reviewed changes

Put container_formatter in quotes

b7fb5a2

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

ludwig-n requested a review from Kipok October 9, 2025 17:01

Kipok approved these changes Oct 9, 2025

View reviewed changes

Kipok merged commit 6ba7655 into main Oct 9, 2025
7 checks passed

Kipok deleted the ludwig-n/swe-bench-slurm-tests branch October 9, 2025 21:45

Glorf pushed a commit that referenced this pull request Oct 10, 2025

SWE-bench: add Slurm tests & other improvements (#920)

22d65bb

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

dgtm777 pushed a commit that referenced this pull request Oct 29, 2025

SWE-bench: add Slurm tests & other improvements (#920)

64f79ec

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com>

This was referenced Dec 4, 2025

Improve Slurm Testing Reproducibility #1050

Closed

Audiobench and LibriSpeech-PC Benchmarks Evaluation #1060

Closed

coderabbitai bot mentioned this pull request Dec 23, 2025

Evaluation support for SWE-rebench #1102

Merged

coderabbitai bot mentioned this pull request Jan 30, 2026

SWE-bench: fix SWE-agent hanging, adjust expected scores #1202

Merged

coderabbitai bot mentioned this pull request Feb 10, 2026

Support mini-swe-agent as agent harness #1212

Merged

dgtm777 pushed a commit that referenced this pull request Mar 18, 2026

SWE-bench: add Slurm tests & other improvements (#920)

05e14eb

Signed-off-by: Nikolai Ludwig <nliudvig@nvidia.com> Signed-off-by: dgitman <dgitman@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWE-bench: add Slurm tests & other improvements#920

SWE-bench: add Slurm tests & other improvements#920
Kipok merged 15 commits intomainfrom
ludwig-n/swe-bench-slurm-tests

ludwig-n commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 9, 2025 •

edited

Loading

Uh oh!

ludwig-n Oct 9, 2025

Uh oh!

coderabbitai bot Oct 9, 2025

Uh oh!

Uh oh!

Kipok Oct 9, 2025

Uh oh!

Kipok Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ludwig-n commented Oct 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ludwig-n Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Kipok Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Kipok Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ludwig-n commented Oct 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

coderabbitai bot Oct 9, 2025 •

edited

Loading